Improving Text Simplification Language Modeling Using Unsimplified Text Data
نویسنده
چکیده
In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. We evaluate the models intrinsically with perplexity and extrinsically on the lexical simplification task from SemEval 2012. We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams.
منابع مشابه
Assessing the relative reading level of sentence pairs for text simplification
While the automatic analysis of the readability of texts has a long history, the use of readability assessment for text simplification has received only little attention so far. In this paper, we explore readability models for identifying differences in the reading levels of simplified and unsimplified versions of sentences. Our experiments show that a relative ranking is preferable to an absol...
متن کاملReadability-based Sentence Ranking for Evaluating Text Simplification
We propose a new method for evaluating the readability of simplified sentences through pair-wise ranking. The validity of the method is established through incorpus and cross-corpus evaluation experiments. The approach correctly identifies the ranking of simplified and unsimplified sentences in terms of their reading level with an accuracy of over 80%, significantly outperforming previous resul...
متن کاملNatural Language Processing for Improving Textual Accessibility ( NLP 4 ITA ) Workshop Programme
Analysis of long sentences are source of problems in advanced applications such as machine translation. With the aim of solving these problems in advanced applications, we have analysed long sentences of two corpora written in Standard Basque in order to make syntactic simplification. The result of this analysis leads us to design a proposal to produce shorter sentences out of long ones. In ord...
متن کاملSimplifying metaphorical language for young readers: A corpus study on news text
The paper presents first results of an ongoing project on text simplification focusing on linguistic metaphors. Based on an analysis of a parallel corpus of news text professionally simplified for different grade levels, we identify six types of simplification choices falling into two broad categories: preserving metaphors or dropping them. An annotation study on almost 300 source sentences wit...
متن کاملSentence Alignment Methods for Improving Text Simplification Systems
We provide several methods for sentencealignment of texts with different complexity levels. Using the best of them, we sentence-align the Newsela corpora, thus providing large training materials for automatic text simplification (ATS) systems. We show that using this dataset, even the standard phrase-based statistical machine translation models for ATS can outperform the state-of-the-art ATS sy...
متن کامل